Attention mechanism for natural language processing

ABSTRACT

A method may include applying a machine learning model, such as a bidirectional encoder representations from transformers model, trained to generate a representation of a word sequence including a reference word, a first candidate noun, and a second candidate noun. The representation may include a first attention map and a second attention map. The first attention map may include attention values indicative of a strength of various linguistic relationships between the reference word and the first candidate noun. The second attention map may include attention values indicative of a strength of various linguistic relationships between the reference word and the second candidate noun. A natural language processing task, such as determining whether the reference word refers to the first candidate noun or the second candidate noun, may be performed based on the first attention map and the second attention map. Related methods and articles of manufacture are also disclosed.

FIELD

The present disclosure generally relates to machine learning and morespecifically to an attention mechanism for natural language processing.

BACKGROUND

Machine learning models may be trained to perform a variety of cognitivetasks. For example, a machine learning model trained to perform naturallanguage processing may classify text by at least assigning, to thetext, one or more labels indicating a sentiment, a topic, and/or anintent associated with the text. Training the machine learning model toperform natural language processing may include adjusting the machinelearning model to minimize the errors present in the output of themachine learning model. For instance, training the machine learningmodel may include adjusting the weights applied by the machine learningmodel in order to minimize a quantity of incorrect labels assigned bythe machine learning model.

SUMMARY

Methods, systems, and articles of manufacture, including computerprogram products, are provided for attention based natural languageprocessing. In one aspect, there is provided a system. The system mayinclude at least one data processor and at least one memory. The atleast one memory may store instructions that result in operations whenexecuted by the at least one data processor. The operations may include:applying a machine learning model trained to generate a representationof a word sequence, the word sequence including a reference word, afirst candidate noun, and a second candidate noun, the representation ofthe word sequence including a first attention map and a second attentionmap, the first attention map including a first plurality of attentionvalues indicative of a first strength of a plurality of linguisticrelationships between the reference word and the first candidate noun,and the second attention map including a second plurality of attentionvalues indicative of a second strength of the plurality of linguisticrelationships between the reference word and the second candidate noun;performing, based at least on the first attention map and the secondattention map, a natural language processing task, the natural languageprocessing task including determining that the reference word refers tothe first candidate noun and not the second reference noun; andgenerating a user interface displaying a result of the natural languageprocessing task.

In some variations, one or more of the features disclosed hereinincluding the following features can optionally be included in anyfeasible combination. A first masked matrix may be generated by at leastapplying, to the first attention map, a first binary mask matrix. Thefirst masked matrix may be generated to include an entry having a firstattention value in response to the first attention value occupying theentry in the first attention map being greater than a second attentionvalue occupying a corresponding entry in the second attention map. Asecond masked matrix may be generated by at least applying, to thesecond attention map, a second binary mask matrix. The second maskedmatrix may be generated to include a zero value in the correspondingentry in response to the second attention value occupying thecorresponding entry in the second attention map being less than thefirst attention value occupying the entry in the first attention map.The first maximum attention score may be a ratio of a first Hadamard sumassociated with the first candidate noun relative to a sum of the firstHadamard sum and a second Hadamard sum associated with the secondcandidate noun. The first Hadamard sum may be determined based on thefirst masked matrix. The second Hadamard sum may be determined based onthe second masked matrix.

In some variations, the plurality of linguistic relationships mayinclude direct objects, noun modifiers, possessive pronouns, passiveauxiliary verbs, prepositions, and coreferences. The machine learningmodel may include a plurality of attention heads. Each of the pluralityof attention heads may be configured to determine a strength of one ofthe plurality of linguistic relationships between each pair of wordspresent in the word sequence.

In some variations, the representation of the word sequence may be anattention tensor that includes the first attention map and the secondattention map.

In some variations, the machine learning model may be a bidirectionalencoder representations from transformers model.

In some variations, the machine learning model may be trained togenerate the representation of the word sequence including by trainingthe machine learning model to perform a masked language modeling and/ora next sentence prediction. The machine learning model may be trained,based at least on unlabeled training data, to generate therepresentation of the word sequence.

In another aspect, there is provided a method for attention basednatural language processing. The method may include: applying a machinelearning model trained to generate a representation of a word sequence,the word sequence including a reference word, a first candidate noun,and a second candidate noun, the representation of the word sequenceincluding a first attention map and a second attention map, the firstattention map including a first plurality of attention values indicativeof a first strength of a plurality of linguistic relationships betweenthe reference word and the first candidate noun, and the secondattention map including a second plurality of attention valuesindicative of a second strength of the plurality of linguisticrelationships between the reference word and the second candidate noun;performing, based at least on the first attention map and the secondattention map, a natural language processing task, the natural languageprocessing task including determining that the reference word refers tothe first candidate noun and not the second reference noun; andgenerating a user interface displaying a result of the natural languageprocessing task.

In some variations, one or more of the features disclosed hereinincluding the following features can optionally be included in anyfeasible combination. The method may further include: generating a firstmasked matrix by at least applying, to the first attention map, a firstbinary mask matrix, the first masked matrix generated to include anentry having a first attention value in response to the first attentionvalue occupying the entry in the first attention map being greater thana second attention value occupying a corresponding entry in the secondattention map; and generating a second masked matrix by at leastapplying, to the second attention map, a second binary mask matrix, thesecond masked matrix generated to include a zero value in thecorresponding entry in response to the second attention value occupyingthe corresponding entry in the second attention map being less than thefirst attention value occupying the entry in the first attention map.The first maximum attention score may be a ratio of a first Hadamard sumassociated with the first candidate noun relative to a sum of the firstHadamard sum and a second Hadamard sum associated with the secondcandidate noun. The first Hadamard sum may be determined based on thefirst masked matrix. The second Hadamard sum may be determined based onthe second masked matrix.

In some variations, the plurality of linguistic relationships mayinclude direct objects, noun modifiers, possessive pronouns, passiveauxiliary verbs, prepositions, and coreferences. The machine learningmodel may include a plurality of attention heads. Each of the pluralityof attention heads may be configured to determine a strength of one ofthe plurality of linguistic relationships between each pair of wordspresent in the word sequence.

In some variations, the representation of the word sequence may be anattention tensor that includes the first attention map and the secondattention map.

In some variations, the machine learning model may be a bidirectionalencoder representations from transformers model.

In some variations, the method may further include training the machinelearning model to generate the representation of the word sequenceincluding by training the machine learning model to perform a maskedlanguage modeling and/or a next sentence prediction. The machinelearning model may be trained, based at least on unlabeled trainingdata, to generate the representation of the word sequence.

In another aspect, there is provided a computer program product thatincludes a non-transitory computer readable storage medium. Thenon-transitory computer-readable storage medium may include program codethat causes operations when executed by at least one data processor. Theoperations may include: applying a machine learning model trained togenerate a representation of a word sequence, the word sequenceincluding a reference word, a first candidate noun, and a secondcandidate noun, the representation of the word sequence including afirst attention map and a second attention map, the first attention mapincluding a first plurality of attention values indicative of a firststrength of a plurality of linguistic relationships between thereference word and the first candidate noun, and the second attentionmap including a second plurality of attention values indicative of asecond strength of the plurality of linguistic relationships between thereference word and the second candidate noun; performing, based at leaston the first attention map and the second attention map, a naturallanguage processing task, the natural language processing task includingdetermining that the reference word refers to the first candidate nounand not the second reference noun; and generating a user interfacedisplaying a result of the natural language processing task.

Implementations of the current subject matter can include methodsconsistent with the descriptions provided herein as well as articlesthat comprise a tangibly embodied machine-readable medium operable tocause one or more machines (e.g., computers, etc.) to result inoperations implementing one or more of the described features.Similarly, computer systems are also described that may include one ormore processors and one or more memories coupled to the one or moreprocessors. A memory, which can include a non-transitorycomputer-readable or machine-readable storage medium, may include,encode, store, or the like one or more programs that cause one or moreprocessors to perform one or more of the operations described herein.Computer implemented methods consistent with one or more implementationsof the current subject matter can be implemented by one or more dataprocessors residing in a single computing system or multiple computingsystems. Such multiple computing systems can be connected and canexchange data and/or commands or other instructions or the like via oneor more connections, including, for example, to a connection over anetwork (e.g. the Internet, a wireless wide area network, a local areanetwork, a wide area network, a wired network, or the like), via adirect connection between one or more of the multiple computing systems,etc.

The details of one or more variations of the subject matter describedherein are set forth in the accompanying drawings and the descriptionbelow. Other features and advantages of the subject matter describedherein will be apparent from the description and drawings, and from theclaims. While certain features of the currently disclosed subject matterare described for illustrative purposes in relation to attentionmechanisms for natural language processing, it should be readilyunderstood that such features are not intended to be limiting. Theclaims that follow this disclosure are intended to define the scope ofthe protected subject matter.

DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute apart of this specification, show certain aspects of the subject matterdisclosed herein and, together with the description, help explain someof the principles associated with the disclosed implementations. In thedrawings,

FIG. 1A depicts a network diagram illustrating a machine learningenabled natural language process system, in accordance with some exampleembodiments;

FIG. 1B depicts an example of a bidirectional encoder representationsfrom transformers (BERT) model, in accordance with some exampleembodiments;

FIG. 2A depicts an example of an attention map, in accordance with someexample embodiments;

FIG. 2B depicts an example of calculating a maximum attention score fora word sequence, in accordance with some example embodiments;

FIG. 3 depicts examples of coreference resolution, in accordance withsome example embodiments;

FIG. 4 depicts a flowchart illustrating a process for performing anatural language processing task, in accordance with some exampleembodiments; and

FIG. 5 depicts a block diagram illustrating a computing system, inaccordance with some example embodiments.

When practical, like labels are used to refer to same or similar itemsin the drawings.

DETAILED DESCRIPTION

A machine learning model may be trained to perform a natural languageprocessing task by at least subjecting the machine learning model tosupervised learning. For example, the machine learning model may betrained to perform coreference resolution, which may include identifyingthe antecedent of an ambiguous pronoun in a word sequence. However,training the machine learning model for optimal performance may requirea large corpus of labeled training samples, each of which including textand at least one ground-truth label corresponding to a correct label forthe text. Because generating a sufficiently large corpus of labeledtraining samples may require excessive resources, training the machinelearning model in a supervised manner may often be impracticable.

As such, in some example embodiments, a machine learning controller maytrain a machine learning model by at least subjecting the machinelearning model to unsupervised training. That is, the machine learningmodel may be trained based on a corpus of unlabeled training samples,which may include different text without any ground-truth labels.Moreover, the trained machine learning model may generate, for asequence of words, one or more attention maps indicating a relationshipbetween the words in the sequence of words. For example, each word inthe word sequence may be associated with an attention map that includesa plurality of attention values, each of which indicating a strength ofa connection to another words in the word sequence. A natural languageprocessing task, such as coreference resolution, may be performed basedon the one or more attention maps. For example, the antecedent of anambiguous pronoun in the word sequence may be identified based at leaston the one or more attention maps.

In some example embodiments, the machine learning model may be trainedto perform a language modeling task. As such, the machine learning modelperforming the language modeling task may generate the one or moreattention maps as a representation of the word sequence. The machinelearning model may be a machine learning model that incorporates anattention mechanism such as, for example, a bidirectional encoderrepresentations from transformers (BERT) model. Unlike a recurrentneural network that processes each word in the word sequencesequentially, the bidirectional encoder representations fromtransformers model may include a transformer configured to process allof the words in the word sequence simultaneously. Training the machinelearning model may include training the machine learning model toperform masked language modeling in which the machine learning modelidentifies words in a sequence of words that have been masked out.Furthermore, the machine learning model may be trained to perform nextsentence prediction and determine whether a first sentence follows asecond sentence. As noted, the machine learning model may be trained inan unsupervised manner using a corpus of unlabeled training samples.

FIG. 1A depicts a system diagram illustrating an example of a machinelearning enabled natural language processing system 100, in accordancewith some example embodiments. Referring to FIG. 1A, the machinelearning enabled natural language processing system 100 may include amachine learning controller 110, a natural language processing engine120, and a client 130. The machine learning controller 110, the naturallanguage processing engine 120, and the client 103 may becommunicatively coupled via a network 140. It should be appreciated thatthe client 130 may be a processor-based device including, for example, asmartphone, a tablet computer, a wearable apparatus, a virtualassistant, an Internet-of-Things (IoT) appliance, and/or the like. Thenetwork 140 may be any wired network and/or a wireless networkincluding, for example, a wide area network (WAN), a local area network(LAN), a virtual local area network (VLAN), a public land mobile network(PLMN), the Internet, and/or the like.

In some example embodiments, the machine learning controller 110 maytrain a machine learning model 115 to perform a language modeling task.The machine learning model 115 performing the language modeling task maygenerate one or more attention maps to represent a word sequence 150including, for example, an attention map 125. The machine learning model115 may be a machine learning model that incorporates an attentionmechanism. For example, the machine learning model 115 may be abidirectional encoder representations from transformers (BERT) modelwhich, as shown in FIG. 1B, includes a transformer configured to processall of the words in the word sequence 150 simultaneously. Moreover, theword sequence 150 may be received from the client 130. For instance, thenatural language processing engine 120 may receive, from the client 130,a request to perform a natural language processing task (e.g.,coreference resolution) on the word sequence 150.

Training the machine learning model 115 may include training the machinelearning model 115 to perform masked language modeling in which themachine learning model 115 identifies masked out words in various wordsequences. Furthermore, the machine learning controller 110 may trainthe machine learning model 115 to perform next sentence prediction,which may include determining whether a first sentence follows a secondsentence. The machine learning model 115 may be trained in anunsupervised manner. That is, the machine learning controller 110 maytrain the machine learning model 115 based on a corpus of unlabeledtraining samples, which may include different text without anyground-truth labels.

As noted, the machine learning model 115 may perform the languagemodeling task and generate the attention map 125 as a representation ofthe word sequence 150. Moreover, the attention map 125 may include aplurality of attention values, each of which indicating a strength of aconnection between two words in the word sequence 150. To furtherillustrate, FIG. 2A depicts an example of an attention map 200 in whichdifferent attention values are visualized as lines having varyingintensities. For example, a dark line between two words (e.g., the word“rabbit” and the word “hopped”) may indicate a strong connection betweenthe two words. Contrastingly, a faint line between two words (e.g., theword “rabbit” and the word “turtle”) may indicate a weak connectionbetween the two words. The natural language processing engine 120 mayperform, based on at least a portion of the attention map 125, one ormore natural language processing tasks. For instance, the naturallanguage processing engine 120 may perform, based at least on theattention map 125, coreference resolution and identify the antecedent ofan ambiguous pronoun in the word sequence 150.

In some example embodiments, the natural language processing engine 120may apply the result of the one or more natural language processingtasks including, for example, the result of the coreference resolutionand identify the antecedent of an ambiguous pronoun in the word sequence150. For example, the natural language processing engine 120 may bedeployed as part of a machine learning based communication system suchas, for example, a chatbot, an issue tracking system, and/or the like.Accordingly, the natural language processing engine 120 may assign,based at least on the result of the one or more language processingtasks, one or more labels to the word sequence 150 that indicate asentiment, a topic, and/or an intent associated with the word sequence150. The natural language processing engine 120 may determine, based atleast on the labels assigned to the word sequence 150, one or moreresponses to the word sequence 150.

In some example embodiments, the attention map 125 may be part of anattention tensor generated by the machine learning model 115. Forexample, the machine learning model 115 may include multiple layers l∈L,with each layer having multiple attention heads h∈H. Each attention headh included in the machine learning model 115 may correspond to adifferent linguistic relationship such as, for example, direct objects,noun modifiers, possessive pronouns, passive auxiliary verbs,prepositions, and coreferences. Accordingly, each attention head h maybe configured to generate attention values corresponding to the strengthof a particular type of linguistic relationship between different pairsof words in the word sequence 150.

For instance, a first attention head h₁ may generate attention valuesindicative of how strongly a first word in the word sequence 150 isconnected to a second word in the word sequence 150 as a possessivepronoun. A different attention head h₂ may generating attention valuesindicative of how strongly the first word in the word sequence 150 isconnected to the second word in the word sequence 150 as a nounmodifier. The attention tensor generated by the machine learning model115 may include the attention values generated by each of the heads h ∈Hoccupying various layers l∈L of the machine learning model 115.

The natural language processing engine 120 may perform, based at leaston the attention tensor, one or more natural language processing tasks.In some example embodiments, the natural language processing engine 120may perform, based at least on the attention tensor, coreferenceresolution, which may include identifying the antecedent of an ambiguouspronoun in the word sequence 150. For example, the word sequence 150 mayinclude a reference words (e.g. a pronoun such as “it”) and an mquantity of candidate nouns C={c₁, . . . , c_(m)} (e.g. a selection ofantecedents referenced by the pronoun “it”). A variety of techniques maybe applied to identify the m quantity of candidate nouns C={c₁, . . . ,c_(m)} including, for example, application and/or interaction specificheuristics. The natural language processing engine 120 may identify,based at least on the attention tensor, a candidate noun c that isreferenced by the reference word s.

According to some example embodiments, the natural language processingengine 120 may identify candidate noun c referenced by the referenceword s by at least determining, for each of the m quantity of candidatenouns C={c₁, . . . , c_(m)}, a maximum attention score. For example, themaximum attention score for the candidate noun c may indicate thestrength of the connection between the candidate noun c and thereference word s as the antecedent that is being referenced by thereference word s. It should be appreciated that the maximum attentionscore for the candidate noun c may include the maximum attention valuesassociated with the candidate noun c. For instance, the BERT attentiontensor A∈

H×L×|C| may be sliced into several matrices A_(c)∈

H×L, each of which corresponding to an attention map between thereference word s and one of the m quantity of candidate nouns C={c₁, . .. , c_(m)}. Each matrix A_(c) may be associated with a binary maskmatrix M_(c). The mask value at each location tuple (l, h) included inthe binary mask matrix M_(c) are shown in Equation (1) below:

$\begin{matrix}{{M_{c}\left( {l,h} \right)} = \left\{ \begin{matrix}1 & {{{argmax}{\;\;}{A\left( {l,h} \right)}} = c} \\0 & {otherwise}\end{matrix} \right.} & (1)\end{matrix}$

Applying the binary mask matrix M_(c) to a matrix A_(c) may eliminatenon-maximum attention values from the matrix A_(c). That is, mask valuesin the binary mask matrix M_(c) may be non-zero only at locations wherethe candidate nouns c is associated with maximum attention (e.g., agreater attention value than other candidate nouns). Accordingly, thebinary mask matrix M_(c) to may be applied to the attention matrix A_(c)to limit the impact of attention and focus on the most salient parts.Given the matrix A_(c) and the binary mask matrix M_(c) for eachcandidate noun c, the natural language processing engine 120 maydetermine the maximum attention score, for example, by at leastcomputing the sum of the Hadamard product of the matrix A_(c) and thebinary mask matrix M_(c). The maximum attention score may be furtherdetermined by computing the ratio of each Hadamard sum relative theHadamard sums for all m quantity of candidate nouns C={c₁, . . . ,c_(m)} as indicated by Equation (2) below:

$\begin{matrix}{{{MAS}(c)} = {\frac{\sum\limits_{l,h}{A_{c} \circ M_{c}}}{\sum\limits_{c \in C}{\sum\limits_{l,h}{A_{c} \circ \; M_{c}}}} \in {\left\lbrack {0,1} \right\rbrack.}}} & (2)\end{matrix}$

To further illustrate, FIG. 2B depicts an example of calculating amaximum attention score for the word sequence 150, in accordance withsome example embodiments. As shown in FIG. 2B, the word sequence 150 mayinclude a reference word s, a first candidate noun c₁, and a secondcandidate noun c₂. To perform the natural language processing task ofcoreference resolution, the natural language processing engine 120 maydetermine whether the reference word s refers to the first candidatenoun c₁ or the second candidate noun c₂.

In some example embodiments, the machine learning model 115 trained toperform the language modeling task may generate the attention tensor A∈

H×L×|C| as a representation of the word sequence 150. The attentiontensor A may include, for each of the m quantity of candidate nounsC={c₁, . . . , c_(m)} included in the word sequence 150, a matrix A_(c)∈

H×L corresponding to an attention map. For example, FIG. 2B shows theattention tensor A as including a first matrix A₁ corresponding to afirst attention map for the first candidate noun c₁ and a second matrixA₂ corresponding to a second attention map for the second candidate nounc₂.

The first matrix A₁ may include a first plurality of attention values,each of which corresponding to a strength of a type of linguisticrelationship between the first candidate noun c₁ and the reference words. Meanwhile, the second matrix A₂ may include a second plurality ofattention values, each of which corresponding to a strength of a type oflinguistic relationship between the second candidate noun c₂ and thereference word s. The first plurality of attention values and the secondplurality of attention values may be determined by the differentattention heads h∈H occupying the various layers l∈L of the machinelearning model 115.

The natural language processing engine 120 may calculate a maximumattention score for each of the first candidate noun c₁ and the secondcandidate noun c₂ including by generating a first masked matrix A₁∘M₁for the first candidate noun c₁ and a second masked matrix A₂∘M₂ for thefirst candidate noun c₂. The first masked matrix A₁∘M₁ and the secondmasked matrix A₂∘M₂ may be generated by comparing the attention valuesoccupying the corresponding entries in the first matrix A₁ and thesecond matrix A₂. For example, the first masked matrix A₁∘M₁ may begenerated by applying a first binary mask matrix M₁ which, as shown inEquation (1) above, may be configured to preserve a first attentionvalue in the first matrix A₁ if the first attention value is greaterthan a second attention value occupying a corresponding entry in thesecond matrix A₂. Contrastingly, the first attention value may become azero value if the second attention value occupying the correspondingentry in the second matrix A₂ is greater than the first attention value.

In some example embodiments, the natural language processing engine 120may determine, based at least on the first masked matrix A₁∘M₁, a firstmaximum attention score for the first candidate noun c₁. The naturallanguage processing engine 120 may further determine, based at least onthe second masked matrix A₂∘M₂, a second maximum attention score for thefirst candidate noun c₂. For instance, the first maximum attention scorefor the first candidate noun c₁ may be a ratio of a sum of the Hadamardproduct of the first matrix A₁ and the corresponding binary mask matrixM₁ relative to the Hadamard sums of every candidate noun including thefirst candidate noun c₁ and the second candidate noun c₂. Meanwhile, thesecond maximum attention score for the second candidate noun c₂ may be aratio of a sum of the Hadamard product of the second matrix A₂ and thecorresponding binary mask matrix M₂ relative to the Hadamard sums ofevery candidate noun included in the word sequence 150. The naturallanguage processing engine 120 may identify the first candidate noun c₁as being the antecedent of the reference word s if, for example, thefirst candidate noun c₁ is associated with a higher maximum attentionscore than the second candidate noun c₂.

FIG. 3 depicts examples of coreference resolution, in accordance withsome example embodiments. As shown FIG. 3, candidate nouns may beassociated with different intensity highlights in order to visualize thedifferences in relative maximum attention scores between the candidatenouns. The maximum attention score associated with a candidate noun mayindicate a strength of a connection to a reference word. That is, thehigher the maximum attention score associated with the candidate noun,the stronger the connection between the candidate noun and the referenceword as the antecedent being referred to by the reference word.Accordingly, the natural language processing engine 120 may identify afirst candidate noun instead of a second candidate noun as being theantecedent of a reference word the first candidate noun is associatedwith a higher maximum attention score than the second candidate noun.

FIG. 4 depicts a flowchart illustrating a process 400 for training amachine learning model, in accordance with some example embodiments.Referring to FIGS. 1A-B, 2A-B, 3, and 4, the process 400 may beperformed by the natural language processing engine 120 in order toperform a natural language processing task such as, for example,coreference resolution.

At 402, the natural language processing engine 120 may apply a machinelearning model trained to generate a representation of a word sequenceincluding a first attention map and a second attention map. For example,the natural language processing engine 120 may receive, from the client130, the word sequence 150 including the reference word s, the firstcandidate noun c₁, and the second candidate noun c₂. The word sequence150 may be associated with a request for the natural language processingengine 120 to perform a natural language processing task such ascoreference resolution, which may require the natural languageprocessing engine 120 to determine whether the reference word s refersto the first candidate noun c₁ or the second candidate noun c₂.

In some example embodiments, the natural language processing engine 120may apply the machine learning model 115, which may be trained by themachine learning controller 110 to perform a language modeling task suchas generating the representation of the word sequence 150. The machinelearning model 115 may be a bidirectional encoder representations fromtransformers (BERT) model, which may include a transformer havingmultiple attention heads h∈H occupying multiple layers l∈L. Eachattention head h included in the machine learning model 115 may beconfigured to generate attention values corresponding to the strength ofa particular type of linguistic relationship between various pairs ofwords in the word sequence 150. Examples of the linguistic relationshipsthat may exist between two words may include direct objects, nounmodifiers, possessive pronouns, passive auxiliary verbs, prepositions,and coreferences.

For example, as shown in FIG. 2B, the machine learning model 115 trainedto perform the language modeling task may generate, as a representationof the word sequence 150, the attention tensor A∈

H×L×|C| including a first matrix A₁ corresponding to a first attentionmap for the first candidate noun c₁ and a second matrix A₂ correspondingto a second attention map for the second candidate noun c₂. The firstmatrix A₁ may include a first plurality of attention valuescorresponding to a strength of various linguistic relationships betweenthe first candidate noun c₁ and the reference word s. Meanwhile, thesecond matrix A₂ may include a second plurality of attention valuescorresponding to a strength of various linguistic relationships betweenthe second candidate noun c₂ and the reference word s.

At 404, the natural language processing engine 120 may determine, basedat least on the first attention map and the second attention map, afirst maximum attention score for a first candidate noun associated withthe first attention map and a second maximum attention score for asecond candidate noun associated with the second attention map. Thenatural language processing engine 120 may calculate a maximum attentionscore for each of the first candidate noun c₁ and the second candidatenoun c₂ including by generating a first masked matrix A₁∘M₁ for thefirst candidate noun c₁ and a second masked matrix A₂∘M₂ for the firstcandidate noun c₂. For example, the first masked matrix A₁∘M₁ may begenerated by applying a first binary mask matrix M₁ configured topreserve a first attention value in the first matrix A₁ if the firstattention value is greater than a second attention value occupying acorresponding entry in the second matrix A₂. Contrastingly, the firstattention value may become a zero value if the second attention valueoccupying the corresponding entry in the second matrix A₂ is greaterthan the first attention value.

In some example embodiments, the natural language processing engine 120may determine a first maximum attention score for the first candidatenoun c₁, which may be a ratio of a sum of the Hadamard product of thefirst matrix A₁ and the corresponding binary mask matrix M₁ relative tothe Hadamard sums of other candidate nouns such as the second candidatenoun c₂. Moreover, the natural language processing engine 120 maydetermine a second maximum attention score for the second candidate nounc₂, which may be a ratio of a sum of the Hadamard product of the secondmatrix A₂ and the corresponding binary mask matrix M₂ relative to theHadamard sums of other candidate nouns such as the first candidate nounc₁.

At 406, the natural language processing engine 120 may perform, based atleast on the first maximum attention score and the second maximumattention score, a natural language processing task. For instance, thenatural language processing engine 120 may be configured to performcoreference resolution, which may require the natural languageprocessing engine 120 to determine whether the reference word s refersto the first candidate noun c₁ or the second candidate noun c₂.According to some example embodiments, the natural language processingengine 120 may perform, based at least on the first maximum attentionscore of the first candidate noun c₁ and the second maximum attentionscore of the second candidate noun c₂, the coreference resolution. Forexample, the natural language processing engine 120 may determine thatthe first candidate noun c₁ is the antecedent of the reference word sbased at least on the first candidate noun c₁ having a higher maximumattention score than the second candidate noun c₂.

FIG. 5 depicts a block diagram illustrating a computing system 500, inaccordance with some example embodiments. Referring to FIGS. 1A and 5,the computing system 500 can be used to implement the machine learningcontroller 110, the natural language processing engine 120, and/or anycomponents therein.

As shown in FIG. 5, the computing system 500 can include a processor510, a memory 520, a storage device 530, and input/output devices 540.The processor 510, the memory 520, the storage device 530, and theinput/output devices 540 can be interconnected via a system bus 550. Theprocessor 510 is capable of processing instructions for execution withinthe computing system 500. Such executed instructions can implement oneor more components of, for example, the machine learning controller 110and the natural language processing engine 120. In some implementationsof the current subject matter, the processor 510 can be asingle-threaded processor. Alternately, the processor 510 can be amulti-threaded processor. The processor 510 is capable of processinginstructions stored in the memory 520 and/or on the storage device 530to display graphical information for a user interface provided via theinput/output device 540.

The memory 520 is a computer readable medium such as volatile ornon-volatile that stores information within the computing system 500.The memory 520 can store data structures representing configurationobject databases, for example. The storage device 530 is capable ofproviding persistent storage for the computing system 500. The storagedevice 530 can be a floppy disk device, a hard disk device, an opticaldisk device, or a tape device, or other suitable persistent storagemeans. The input/output device 540 provides input/output operations forthe computing system 500. In some implementations of the current subjectmatter, the input/output device 540 includes a keyboard and/or pointingdevice. In various implementations, the input/output device 540 includesa display unit for displaying graphical user interfaces.

According to some implementations of the current subject matter, theinput/output device 540 can provide input/output operations for anetwork device. For example, the input/output device 540 can includeEthernet ports or other networking ports to communicate with one or morewired and/or wireless networks (e.g., a local area network (LAN), a widearea network (WAN), the Internet).

In some implementations of the current subject matter, the computingsystem 500 can be used to execute various interactive computer softwareapplications that can be used for organization, analysis and/or storageof data in various (e.g., tabular) format (e.g., Microsoft Excel®,and/or any other type of software). Alternatively, the computing system500 can be used to execute any type of software applications. Theseapplications can be used to perform various functionalities, e.g.,planning functionalities (e.g., generating, managing, editing ofspreadsheet documents, word processing documents, and/or any otherobjects, etc.), computing functionalities, communicationsfunctionalities, etc. The applications can include various add-infunctionalities (e.g., SAP Integrated Business Planning add-in forMicrosoft Excel as part of the SAP Business Suite, as provided by SAPSE, Walldorf, Germany) or can be standalone computing products and/orfunctionalities. Upon activation within the applications, thefunctionalities can be used to generate the user interface provided viathe input/output device 540. The user interface can be generated andpresented to a user by the computing system 500 (e.g., on a computerscreen monitor, etc.).

One or more aspects or features of the subject matter described hereincan be realized in digital electronic circuitry, integrated circuitry,specially designed ASICs, field programmable gate arrays (FPGAs)computer hardware, firmware, software, and/or combinations thereof.These various aspects or features can include implementation in one ormore computer programs that are executable and/or interpretable on aprogrammable system including at least one programmable processor, whichcan be special or general purpose, coupled to receive data andinstructions from, and to transmit data and instructions to, a storagesystem, at least one input device, and at least one output device. Theprogrammable system or computing system may include clients and servers.A client and server are generally remote from each other and typicallyinteract through a communication network. The relationship of client andserver arises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

These computer programs, which can also be referred to as programs,software, software applications, applications, components, or code,include machine instructions for a programmable processor, and can beimplemented in a high-level procedural and/or object-orientedprogramming language, and/or in assembly/machine language. As usedherein, the term “machine-readable medium” refers to any computerprogram product, apparatus and/or device, such as for example magneticdiscs, optical disks, memory, and Programmable Logic Devices (PLDs),used to provide machine instructions and/or data to a programmableprocessor, including a machine-readable medium that receives machineinstructions as a machine-readable signal. The term “machine-readablesignal” refers to any signal used to provide machine instructions and/ordata to a programmable processor. The machine-readable medium can storesuch machine instructions non-transitorily, such as for example as woulda non-transient solid-state memory or a magnetic hard drive or anyequivalent storage medium. The machine-readable medium can alternativelyor additionally store such machine instructions in a transient manner,such as for example, as would a processor cache or other random accessmemory associated with one or more physical processor cores.

To provide for interaction with a user, one or more aspects or featuresof the subject matter described herein can be implemented on a computerhaving a display device, such as for example a cathode ray tube (CRT) ora liquid crystal display (LCD) or a light emitting diode (LED) monitorfor displaying information to the user and a keyboard and a pointingdevice, such as for example a mouse or a trackball, by which the usermay provide input to the computer. Other kinds of devices can be used toprovide for interaction with a user as well. For example, feedbackprovided to the user can be any form of sensory feedback, such as forexample visual feedback, auditory feedback, or tactile feedback; andinput from the user may be received in any form, including acoustic,speech, or tactile input. Other possible input devices include touchscreens or other touch-sensitive devices such as single or multi-pointresistive or capacitive track pads, voice recognition hardware andsoftware, optical scanners, optical pointers, digital image capturedevices and associated interpretation software, and the like.

The subject matter described herein can be embodied in systems,apparatus, methods, and/or articles depending on the desiredconfiguration. The implementations set forth in the foregoingdescription do not represent all implementations consistent with thesubject matter described herein. Instead, they are merely some examplesconsistent with aspects related to the described subject matter.Although a few variations have been described in detail above, othermodifications or additions are possible. In particular, further featuresand/or variations can be provided in addition to those set forth herein.For example, the implementations described above can be directed tovarious combinations and subcombinations of the disclosed featuresand/or combinations and subcombinations of several further featuresdisclosed above. In addition, the logic flows depicted in theaccompanying figures and/or described herein do not necessarily requirethe particular order shown, or sequential order, to achieve desirableresults. For example, the logic flows may include different and/oradditional operations than shown without departing from the scope of thepresent disclosure. One or more operations of the logic flows may berepeated and/or omitted without departing from the scope of the presentdisclosure. Other implementations may be within the scope of thefollowing claims.

What is claimed is:
 1. A system, comprising: at least one dataprocessor; and at least one memory storing instructions which, whenexecuted by the at least one data processor, result in operationscomprising: applying a machine learning model trained to generate arepresentation of a word sequence, the word sequence including areference word, a first candidate noun, and a second candidate noun, therepresentation of the word sequence including a first attention map anda second attention map, the first attention map including a firstplurality of attention values indicative of a first strength of aplurality of linguistic relationships between the reference word and thefirst candidate noun, and the second attention map including a secondplurality of attention values indicative of a second strength of theplurality of linguistic relationships between the reference word and thesecond candidate noun; performing, based at least on the first attentionmap and the second attention map, a natural language processing task,the natural language processing task including determining that thereference word refers to the first candidate noun and not a secondreference noun; and generating a user interface displaying a result ofthe natural language processing task.
 2. The system of claim 1, furthercomprising: generating a first masked matrix by at least applying, tothe first attention map, a first binary mask matrix, the first maskedmatrix generated to include an entry having a first attention value inresponse to the first attention value occupying the entry in the firstattention map being greater than a second attention value occupying acorresponding entry in the second attention map.
 3. The system of claim2, further comprising: generating a second masked matrix by at leastapplying, to the second attention map, a second binary mask matrix, thesecond masked matrix generated to include a zero value in thecorresponding entry in response to the second attention value occupyingthe corresponding entry in the second attention map being less than thefirst attention value occupying the entry in the first attention map. 4.The system of claim 3, wherein a first maximum attention score comprisesa ratio of a first Hadamard sum associated with the first candidate nounrelative to a sum of the first Hadamard sum and a second Hadamard sumassociated with the second candidate noun, wherein the first Hadamardsum is determined based on the first masked matrix, and wherein thesecond Hadamard sum is determined based on the second masked matrix. 5.The system of claim 1, wherein the plurality of linguistic relationshipsinclude direct objects, noun modifiers, possessive pronouns, passiveauxiliary verbs, prepositions, and coreferences.
 6. The system of claim5, wherein the machine learning model includes a plurality of attentionheads, and wherein each of the plurality of attention heads isconfigured to determine a strength of one of the plurality of linguisticrelationships between each pair of words present in the word sequence.7. The system of claim 1, wherein the representation of the wordsequence comprises an attention tensor that includes the first attentionmap and the second attention map.
 8. The system of claim 1, wherein themachine learning model comprises a bidirectional encoder representationsfrom transformers model.
 9. The system of claim 1, further comprising:training the machine learning model to generate the representation ofthe word sequence including by training the machine learning model toperform a masked language modeling and/or a next sentence prediction.10. The system of claim 9, wherein the machine learning model istrained, based at least on unlabeled training data, to generate therepresentation of the word sequence.
 11. A computer-implemented method,comprising: applying a machine learning model trained to generate arepresentation of a word sequence, the word sequence including areference word, a first candidate noun, and a second candidate noun, therepresentation of the word sequence including a first attention map anda second attention map, the first attention map including a firstplurality of attention values indicative of a first strength of aplurality of linguistic relationships between the reference word and thefirst candidate noun, and the second attention map including a secondplurality of attention values indicative of a second strength of theplurality of linguistic relationships between the reference word and thesecond candidate noun; performing, based at least on the first attentionmap and the second attention map, a natural language processing task,the natural language processing task including determining that thereference word refers to the first candidate noun and not a secondreference noun; and generating a user interface displaying a result ofthe natural language processing task.
 12. The computer-implementedmethod of claim 11, further comprising: generating a first masked matrixby at least applying, to the first attention map, a first binary maskmatrix, the first masked matrix generated to include an entry having afirst attention value in response to the first attention value occupyingthe entry in the first attention map being greater than a secondattention value occupying a corresponding entry in the second attentionmap; and generating a second masked matrix by at least applying, to thesecond attention map, a second binary mask matrix, the second maskedmatrix generated to include a zero value in the corresponding entry inresponse to the second attention value occupying the corresponding entryin the second attention map being less than the first attention valueoccupying the entry in the first attention map.
 13. Thecomputer-implemented method of claim 12, wherein a first maximumattention score comprises a ratio of a first Hadamard sum associatedwith the first candidate noun relative to a sum of the first Hadamardsum and a second Hadamard sum associated with the second candidate noun,wherein the first Hadamard sum is determined based on the first maskedmatrix, and wherein the second Hadamard sum is determined based on thesecond masked matrix.
 14. The computer-implemented method of claim 11,wherein the plurality of linguistic relationships include directobjects, noun modifiers, possessive pronouns, passive auxiliary verbs,prepositions, and coreferences.
 15. The computer-implemented method ofclaim 14, wherein the machine learning model includes a plurality ofattention heads, and wherein each of the plurality of attention heads isconfigured to determine a strength of one of the plurality of linguisticrelationships between each pair of words present in the word sequence.16. The computer-implemented method of claim 11, wherein therepresentation of the word sequence comprises an attention tensor thatincludes the first attention map and the second attention map.
 17. Thecomputer-implemented method of claim 11, wherein the machine learningmodel comprises a bidirectional encoder representations fromtransformers model.
 18. The computer-implemented method of claim 11,further comprising: training the machine learning model to generate therepresentation of the word sequence including by training the machinelearning model to perform a masked language modeling and/or a nextsentence prediction.
 19. The computer-implemented method of claim 18,wherein the machine learning model is trained, based at least onunlabeled training data, to generate the representation of the wordsequence.
 20. A non-transitory computer readable medium storinginstructions, which when executed by at least one data processor, resultin operations comprising: applying a machine learning model trained togenerate a representation of a word sequence, the word sequenceincluding a reference word, a first candidate noun, and a secondcandidate noun, the representation of the word sequence including afirst attention map and a second attention map, the first attention mapincluding a first plurality of attention values indicative of a firststrength of a plurality of linguistic relationships between thereference word and the first candidate noun, and the second attentionmap including a second plurality of attention values indicative of asecond strength of the plurality of linguistic relationships between thereference word and the second candidate noun; performing, based at leaston the first attention map and the second attention map, a naturallanguage processing task, the natural language processing task includingdetermining that the reference word refers to the first candidate nounand not a second reference noun; and generating a user interfacedisplaying a result of the natural language processing task.